Data
| Column Name | Data Type | Description |
|---|---|---|
| User id | Integer | Unique user identifier |
| Loan category | String | Categorical variable |
| Amount | Integer | Loan amount |
| Interest Rate | Integer | Interest rate |
| Tenure | Integer | Loan tenure |
| Employment type | String | Categorical variable |
| Tier of Employment | Categorical and Ordinal | Employment tier classification |
| Industry | Categorical | Industry type |
| Role | Categorical | Role description |
| Work Experience | Categorical and Ordinal | Work experience category |
| Total Income(PA) | Integer | Total annual income |
| Gender | Categorical | Gender of the user |
| Married | Categorical | Marital status |
| Dependents | Integer | Number of dependents |
| Home | Categorical | Housing category |
| Pincode | Unknown | Pincode information |
| Social Profile | Categorical | Social profile of the user |
| Is_verified | Categorical | Verification status |
| Delinq_2yrs | Integer | Number of delinquencies |
| Total Payment | Integer | Total payment received |
| Received Principal | Integer | Principal amount received |
| Interest Received | Integer | Interest amount received |
| Number of loans | Integer | Number of loans taken |
| Defaulter | Categorical | Loan defaulter classification |
Column to be predicted "Defaulter"
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')
from Source.setup import *
from Source.eda import *
from Source.machine_learning2 import *
# Specify the file path of the Excel file containing the dataset
path = '../Input/Credit_Risk_Dataset.xlsx'
# Call the function to read the Excel data
sheet_names= ['loan_information', 'Employment','Personal_information', 'Other_information' ]
dfs = read_excel_data(path, sheet_names)
loan_information = dfs[0]
employment = dfs[1]
personal_information = dfs[2]
other_information = dfs[3]
employment
| User id | Employmet type | Tier of Employment | Industry | Role | Work Experience | Total Income(PA) | |
|---|---|---|---|---|---|---|---|
| 0 | 7013527 | Salaried | B | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGAZSEWdkcndwkcnCCM | 1-2 | 125000.0 |
| 1 | 7014291 | Self - Employeed | D | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGNCSEWdkcndwkcnCCM | 10+ | 61000.0 |
| 2 | 7014327 | missing | NaN | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGNYSEWdkcndwkcnCCM | 5-10 | 100000.0 |
| 3 | 7014304 | NaN | NaN | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGCASEWdkcndwkcnCCM | 2-3 | 30000.0 |
| 4 | 7031995 | NaN | NaN | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGCASEWdkcndwkcnCCM | 10+ | 65000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 134412 | 55993166 | NaN | NaN | 0 | KHMbckjadbckIFGTNSEWdkcndwkcnCCM | 0 | 100000.0 |
| 134413 | 56263229 | NaN | NaN | 0 | KHMbckjadbckIFGOHSEWdkcndwkcnCCM | 0 | 27000.0 |
| 134414 | 55992716 | NaN | NaN | 0 | KHMbckjadbckIFGCASEWdkcndwkcnCCM | 0 | 53000.0 |
| 134415 | 56363313 | NaN | NaN | 0 | KHMbckjadbckIFGVASEWdkcndwkcnCCM | 0 | 75400.0 |
| 134416 | 56122051 | Salaried | D | 0 | KHMbckjadbckIFGNVSEWdkcndwkcnCCM | 0 | 78000.0 |
134417 rows × 7 columns
# Merge 'loan_information' and 'Employment' dataframes based on 'User_id'
merged_df = pd.merge(loan_information, employment, left_on='User_id', right_on='User id')
# Merge the previously merged dataframe with 'personal_information' based on 'User_id'
merged_df = pd.merge(merged_df, personal_information, left_on='User_id', right_on='User id')
# Merge the previously merged dataframe with 'other_information' based on 'User_id'
merged_df = pd.merge(merged_df, other_information, left_on='User_id', right_on='User_id')
df=merged_df
# Display the first few rows of the merged dataframe
df.head()
| User_id | Loan Category | Amount | Interest Rate | Tenure(years) | User id_x | Employmet type | Tier of Employment | Industry | Role | ... | Home | Pincode | Social Profile | Is_verified | Delinq_2yrs | Total Payement | Received Principal | Interest Received | Number of loans | Defaulter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7013527 | Consolidation | 55884.0 | 11.84 | 6 | 7013527 | Salaried | B | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGAZSEWdkcndwkcnCCM | ... | rent | XX852X | No | NaN | 0 | 1824.150000 | 971.46 | 852.69 | 0 | 1 |
| 1 | 7014291 | Consolidation | 55511.0 | 16.94 | 4 | 7014291 | Self - Employeed | D | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGNCSEWdkcndwkcnCCM | ... | mortgage | XX286X | NaN | Source Verified | 0 | 22912.532998 | 18000.00 | 4912.53 | 0 | 0 |
| 2 | 7014327 | Consolidation | 12289.0 | 11.84 | 6 | 7014327 | missing | NaN | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGNYSEWdkcndwkcnCCM | ... | own | XX113X | No | NaN | 0 | 7800.440000 | 4489.76 | 3310.68 | 0 | 1 |
| 3 | 7014304 | Credit Card | 29324.0 | 14.71 | 4 | 7014304 | NaN | NaN | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGCASEWdkcndwkcnCCM | ... | rent | XX941X | Yes | NaN | 0 | 6672.050000 | 5212.29 | 1459.76 | 0 | 0 |
| 4 | 7031995 | Credit Card | 30252.0 | 14.71 | 4 | 7031995 | NaN | NaN | mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU... | KHMbckjadbckIFGCASEWdkcndwkcnCCM | ... | rent | XX913X | No | Verified | 0 | 11793.001345 | 10000.00 | 1793.00 | 0 | 0 |
5 rows × 26 columns
Goal: To Understand
# Display data types for each column in the DataFrame. Goal is to see if there is any column with the wrong data type.
df.dtypes
User_id int64 Loan Category object Amount float64 Interest Rate float64 Tenure(years) int64 User id_x int64 Employmet type object Tier of Employment object Industry object Role object Work Experience object Total Income(PA) float64 User id_y int64 Gender object Married object Dependents int64 Home object Pincode object Social Profile object Is_verified object Delinq_2yrs int64 Total Payement float64 Received Principal float64 Interest Received float64 Number of loans int64 Defaulter int64 dtype: object
# We can use describe pandas function to learn basic statistics of all numerical columns in our data.
df.describe()
| User_id | Amount | Interest Rate | Tenure(years) | User id_x | Total Income(PA) | User id_y | Dependents | Delinq_2yrs | Total Payement | Received Principal | Interest Received | Number of loans | Defaulter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.437270e+05 | 1.118030e+05 | 143727.000000 | 143727.000000 | 1.437270e+05 | 1.437270e+05 | 1.437270e+05 | 143727.000000 | 143727.000000 | 143727.000000 | 143727.000000 | 143727.000000 | 143727.000000 | 143727.000000 |
| mean | 3.733474e+07 | 1.376458e+05 | 12.038703 | 4.514503 | 3.733474e+07 | 7.242256e+04 | 3.733474e+07 | 1.997113 | 0.276601 | 10840.882025 | 8407.001538 | 2352.729746 | 0.005573 | 0.093712 |
| std | 3.182811e+07 | 1.575427e+05 | 3.880204 | 0.874241 | 3.182811e+07 | 5.539328e+04 | 3.182811e+07 | 1.411317 | 0.789019 | 8613.421536 | 7224.396032 | 2447.759754 | 0.098785 | 0.291429 |
| min | 2.080360e+05 | 0.000000e+00 | 5.420000 | 4.000000 | 2.080360e+05 | 4.000000e+03 | 2.080360e+05 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 3.276118e+06 | 2.837950e+04 | 9.180000 | 4.000000 | 3.276118e+06 | 4.500000e+04 | 3.276118e+06 | 1.000000 | 0.000000 | 4674.775000 | 3104.160000 | 841.685000 | 0.000000 | 0.000000 |
| 50% | 5.766517e+07 | 7.660300e+04 | 11.840000 | 4.000000 | 5.766517e+07 | 6.180000e+04 | 5.766517e+07 | 2.000000 | 0.000000 | 8212.410000 | 6000.000000 | 1632.240000 | 0.000000 | 0.000000 |
| 75% | 6.729174e+07 | 2.054645e+05 | 14.440000 | 6.000000 | 6.729174e+07 | 8.600000e+04 | 6.729174e+07 | 3.000000 | 0.000000 | 14175.769677 | 11664.365000 | 2904.770000 | 0.000000 | 0.000000 |
| max | 7.895894e+07 | 8.000078e+06 | 23.540000 | 6.000000 | 7.895894e+07 | 7.141778e+06 | 7.895894e+07 | 4.000000 | 22.000000 | 57777.579870 | 35000.010000 | 24205.620000 | 5.000000 | 1.000000 |
# Let's check how many missing values do we have in each column in our dataframe
df.isnull().sum()
User_id 0 Loan Category 0 Amount 31924 Interest Rate 0 Tenure(years) 0 User id_x 0 Employmet type 84641 Tier of Employment 84642 Industry 4 Role 0 Work Experience 4 Total Income(PA) 0 User id_y 0 Gender 0 Married 48143 Dependents 0 Home 0 Pincode 0 Social Profile 47856 Is_verified 35803 Delinq_2yrs 0 Total Payement 0 Received Principal 0 Interest Received 0 Number of loans 0 Defaulter 0 dtype: int64
Analysis: Handling Missing Values
# Drop rows with missing values in the 'Industry' and 'Work Experience' columns as the data in 'Industry' is meaningless due to encryption, and 'Work Experience' is inconsistent in the dataset, treating it as an object datatype variable which may impact model performance.
df = df.dropna(subset=['Industry', 'Work Experience'])
# Call the function to replace null values with "missing" in the following columns
replace_with='missing'
columns_to_replace = ['Social Profile', 'Is_verified', 'Married', 'Employmet type']
df= replace_null_values_with_a_value(df, columns_to_replace, replace_with)
#Create a new variable "amount_missing" to indicate if the 'Amount' is missing or not. Assign 1 if 'Amount' is null, otherwise assign 0.
df['amount_missing'] = np.where(df['Amount'].isnull(), 1, 0)
#Replace the null values in the 'Amount' column with the value "-1000" to differentiate them from the rest of the data.
replace_with= - 1000
columns_to_replace = ['Amount']
df= replace_null_values_with_a_value(df, columns_to_replace,replace_with)
# Replace the null values in the 'Tier of Employment' column with the string "Z" to categorize them separately.
replace_with='Z'
columns_to_replace = ['Tier of Employment']
df= replace_null_values_with_a_value(df, columns_to_replace,replace_with)
#Check for null rows in the DataFrame to confirm if the data is clean and does not contain any missing values that could potentially impact the performance of the model.
df.isnull().sum()
User_id 0 Loan Category 0 Amount 0 Interest Rate 0 Tenure(years) 0 User id_x 0 Employmet type 0 Tier of Employment 0 Industry 0 Role 0 Work Experience 0 Total Income(PA) 0 User id_y 0 Gender 0 Married 0 Dependents 0 Home 0 Pincode 0 Social Profile 0 Is_verified 0 Delinq_2yrs 0 Total Payement 0 Received Principal 0 Interest Received 0 Number of loans 0 Defaulter 0 amount_missing 0 dtype: int64
# Call the function to print the number of unique values in all columns
unique_values_each_column(df)
User_id : 133748 Loan Category : 7 Amount : 86157 Interest Rate : 137 Tenure(years) : 2 User id_x : 133748 Employmet type : 3 Tier of Employment : 8 Industry : 12974 Role : 46 Work Experience : 7 Total Income(PA) : 11380 User id_y : 133748 Gender : 3 Married : 3 Dependents : 5 Home : 5 Pincode : 844 Social Profile : 3 Is_verified : 4 Delinq_2yrs : 22 Total Payement : 104199 Received Principal : 42021 Interest Received : 93859 Number of loans : 5 Defaulter : 2 amount_missing : 2
Observations
# Dropping Industry Column and User_IDs as it doesn't give any significant information
# Drop 'Pincode' column: Considering privacy concerns, the 'Pincode' data is encrypted. To address these concerns, it is recommended to remove the 'Pincode' column from the dataset.
columns_to_drop = ['Industry', 'User_id','User id_x','User id_y','Pincode','Role']
# Call the function to drop columns
drop_columns(df, columns_to_drop)
| Loan Category | Amount | Interest Rate | Tenure(years) | Employmet type | Tier of Employment | Work Experience | Total Income(PA) | Gender | Married | ... | Home | Social Profile | Is_verified | Delinq_2yrs | Total Payement | Received Principal | Interest Received | Number of loans | Defaulter | amount_missing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Consolidation | 55884.0 | 11.84 | 6 | Salaried | B | 1-2 | 125000.0 | Female | Yes | ... | rent | No | missing | 0 | 1824.150000 | 971.46 | 852.69 | 0 | 1 | 0 |
| 1 | Consolidation | 55511.0 | 16.94 | 4 | Self - Employeed | D | 10+ | 61000.0 | Female | No | ... | mortgage | missing | Source Verified | 0 | 22912.532998 | 18000.00 | 4912.53 | 0 | 0 | 0 |
| 2 | Consolidation | 12289.0 | 11.84 | 6 | missing | Z | 5-10 | 100000.0 | Other | missing | ... | own | No | missing | 0 | 7800.440000 | 4489.76 | 3310.68 | 0 | 1 | 0 |
| 3 | Credit Card | 29324.0 | 14.71 | 4 | missing | Z | 2-3 | 30000.0 | Male | missing | ... | rent | Yes | missing | 0 | 6672.050000 | 5212.29 | 1459.76 | 0 | 0 | 0 |
| 4 | Credit Card | 30252.0 | 14.71 | 4 | missing | Z | 10+ | 65000.0 | Male | missing | ... | rent | No | Verified | 0 | 11793.001345 | 10000.00 | 1793.00 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143722 | Credit Card | 287564.0 | 12.06 | 6 | missing | Z | 0 | 100000.0 | Female | No | ... | mortgage | missing | Source Verified | 1 | 12842.400000 | 7193.83 | 5648.57 | 0 | 0 | 0 |
| 143723 | Other | 49612.0 | 14.10 | 4 | missing | Z | 0 | 27000.0 | Male | Yes | ... | rent | No | Source Verified | 0 | 839.200000 | 581.67 | 257.53 | 0 | 0 | 0 |
| 143724 | Consolidation | 13632.0 | 18.96 | 6 | missing | Z | 0 | 53000.0 | Female | Yes | ... | rent | missing | missing | 0 | 6085.800000 | 2434.49 | 3651.31 | 0 | 0 | 0 |
| 143725 | Home | 186923.0 | 8.28 | 4 | missing | Z | 0 | 75400.0 | Other | No | ... | mortgage | missing | Source Verified | 0 | 4080.640000 | 3286.82 | 793.82 | 0 | 0 | 0 |
| 143726 | Consolidation | 181161.0 | 15.87 | 6 | Salaried | D | 0 | 78000.0 | Other | missing | ... | mortgage | Yes | Not Verified | 0 | 10064.160000 | 4702.66 | 5361.50 | 0 | 0 | 0 |
143723 rows × 21 columns
Analysis on Dropping Industry and Pincode
Drop 'Industry': As a non-ordinal categorical variable, we need to address the issue of the high number of categories to prevent the model's dimensionality from becoming too large. Since there is no effective way to group these categories into broader categories, it is recommended to drop this column for the time being.
Convert 'Pincode' to latitude and longitude variables: Considering that 'Pincode' represents location data, it might be beneficial to transform it into latitude and longitude variables. This conversion can provide more meaningful spatial information that can potentially enhance the analysis.
Analysis on Current DataFrame
Employment Type and Tier of Employment: When the employment type is missing, the corresponding tier of employment is also empty. It is important to address these missing values and find an appropriate approach to handle them.
Missing Data in the Married Column: The presence of missing data in the 'Married' column raises questions about how to interpret this missing information. Considering the impact on machine learning models' performance, it is crucial to decide whether to treat it as a new category or explore the reasons behind the missing values.
Empty Social Profiles: Empty social profiles can be treated as a distinct category to account for the missing information.
Is Verified Column: The reason behind the empty values in the 'Is Verified' column is unclear. Assigning a new category to represent these missing values would be a suitable approach.
Removal of Industry and Work Experience: Since the number of rows with missing values in the 'Industry' and 'Work Experience' columns is minimal (only 4 rows), removing them is unlikely to have a significant impact. Hence, removing these rows can be considered.
Handling Empty Amounts: Decisions need to be made regarding the rows with missing values in the 'Amount' column. Since the missing values constitute a substantial portion (more than 20%) of the data, careful consideration is required to understand the implications of removing these rows on the data distribution and model performance.
To assess the relationships between variables in the input DataFrame, we compute a correlation matrix, that tells how much two variables are correlated.
# Multicollinarity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression mode
correlation_heatmap(df)
Observation from the heatmap
No two variables have high correlation with each other, so there is no issue of multicollinearity. It's safe to use all variables in machine learning model building.
We can confirm non-linear relationship b/w features by looking at pair-wise scatter plots below
The following code generates a scatter plot matrix, also known as a pair plot, of all numeric features in the input DataFrame using the seaborn library.
The diagonal of the plot matrix shows a histogram of each variable's distribution. This allows for visual inspection of the pairwise relationships between variables, which can be useful for identifying patterns, trends, correlations, or potential outliers in the data.
# Let's plot all interaction scatter plots using seaborn
# Call the function to plot pairwise scatter plots
plot_pairwise_scatter(df)
Analysis
Scatterplots can help us confirm multicollinearity between two variables. We can look at the scatterplot to check if there is any pattern or correlation between two variables.**
Multicollinearity makes explainability less trustworthy as change in one variablte will not only impact the target variable but also impact other X variables. Means how much does a variable impact target variable would be hard
# To identify anomalies in each column, it is essential to examine the distributions of the variables in the dataset.
# Call the function to plot histograms
Feature_Distributions_Histogram(df)
Analysis after Distribution:
Amount: The distribution of the 'Amount' variable is right-skewed, indicating that a majority of loan amounts are lower, while a few instances have higher values.
Employment Type: The distribution of the 'Employment Type' variable shows an imbalance, suggesting that certain employment types may be overrepresented in the dataset compared to others.
Work Experience: The 'Work Experience' variable also exhibits imbalanced data, implying that certain levels of work experience may be more prevalent than others.
Pincode: The 'Pincode' variable contains a large number of categories, which may pose challenges for analysis. Considering converting it into latitude and longitude coordinates could offer a more manageable representation.
Delinq_2years: The distribution of the 'Delinq_2years' variable is right-skewed, indicating that most individuals have a low number of delinquencies, while a few have a higher count.
Payment: The 'Payment' variable displays a right-skewed distribution, suggesting that the majority of payment amounts are lower, with a few instances of higher payments.
Received Principal: The distribution of the 'Received Principal' variable is right-skewed, indicating that most individuals have received a lower principal amount, while a few have received a higher amount.
Interest Received: The 'Interest Received' variable exhibits a right-skewed distribution, suggesting that the majority of individuals have received a lower interest amount, while a few have received a higher interest payment.
Let's print skewness in each feature and use log transformation to fix skewness.
Note
It is important to note that there are numerous features in the dataset with a value of 0. To address this issue and normalize the data, we can apply a log transformation specifically to the non-zero values. By taking the logarithm of these values, we can achieve a more symmetric distribution and reduce the impact of extreme values. This transformation can be particularly useful when working with skewed data or variables that exhibit a wide range of values.
Let's only transform features if skewness is in the following range
# Add all the features to check and fix skewness in features_log array
features_log= ['Amount','Interest Rate','Tenure(years)','Dependents','Total Payement ','Received Principal','Interest Received']
df= fix_skewness(df, features_log)
Inital skewness in feature: Amount is: 4.263131668127465 Final skewness in feature: Amount is: -1.1506200219575988
Inital skewness in feature: Interest Rate is: 0.35701836819248667 Inital skewness in feature: Tenure(years) is: 1.110725712963069 Inital skewness in feature: Dependents is: 0.0009324906324210873 Inital skewness in feature: Total Payement is: 1.56758246496732 Inital skewness in feature: Received Principal is: 1.5217975693302848 Inital skewness in feature: Interest Received is: 2.8360693754096626
import pandas as pd
# Add all categorical features for categorical one-hot encoding in categorical_features array
data = df
categorical_features= ["Gender", "Married", "Home", "Social Profile", "Loan Category", "Employmet type","Is_verified", ]
# Perform one-hot encoding using pandas get_dummies() function
encoded_data = pd.get_dummies(data, columns=categorical_features)
# Define the ordinal categorical features array
ordinal_features = ["Tier of Employment", "Work Experience"]
# Define the pandas DataFrame for encoding
data = encoded_data
# Create a custom mapping of categories to numerical labels
tier_employment_order= list(encoded_data["Tier of Employment"].unique())
tier_employment_order.sort()
work_experience_order= [ 0, '<1', '1-2', '2-3', '3-5', '5-10','10+']
custom_mapping = [tier_employment_order, work_experience_order]
# Call the function to perform ordinal encoding
data = perform_ordinal_encoding(data, ordinal_features, custom_mapping)
data
| Amount | Interest Rate | Tenure(years) | Tier of Employment | Work Experience | Total Income(PA) | Dependents | Delinq_2yrs | Total Payement | Received Principal | ... | Loan Category_Home | Loan Category_Medical | Loan Category_Other | Employmet type_Salaried | Employmet type_Self - Employeed | Employmet type_missing | Is_verified_Not Verified | Is_verified_Source Verified | Is_verified_Verified | Is_verified_missing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.931033 | 11.84 | 6 | 1.0 | 2.0 | 125000.0 | 4 | 0 | 1824.150000 | 971.46 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 10.924336 | 16.94 | 4 | 3.0 | 6.0 | 61000.0 | 1 | 0 | 22912.532998 | 18000.00 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 2 | 9.416460 | 11.84 | 6 | 7.0 | 5.0 | 100000.0 | 3 | 0 | 7800.440000 | 4489.76 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3 | 10.286162 | 14.71 | 4 | 7.0 | 3.0 | 30000.0 | 1 | 0 | 6672.050000 | 5212.29 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 10.317318 | 14.71 | 4 | 7.0 | 6.0 | 65000.0 | 3 | 0 | 11793.001345 | 10000.00 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143722 | 12.569201 | 12.06 | 6 | 7.0 | 0.0 | 100000.0 | 2 | 1 | 12842.400000 | 7193.83 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 143723 | 10.811988 | 14.10 | 4 | 7.0 | 0.0 | 27000.0 | 3 | 0 | 839.200000 | 581.67 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 143724 | 9.520175 | 18.96 | 6 | 7.0 | 0.0 | 53000.0 | 2 | 0 | 6085.800000 | 2434.49 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 143725 | 12.138452 | 8.28 | 4 | 7.0 | 0.0 | 75400.0 | 0 | 0 | 4080.640000 | 3286.82 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 143726 | 12.107141 | 15.87 | 6 | 3.0 | 0.0 | 78000.0 | 0 | 0 | 10064.160000 | 4702.66 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
143723 rows × 42 columns
Oversampling
Increase the number of instances in the minority class (defaulters) by duplicating existing examples or generating synthetic examples to achieve a balanced dataset using SMOTE (Synthetic Minority Over-sampling Technique).
# Specify the name of the target variable column
target_column="Defaulter"
X, y= fix_imbalance_using_oversamping(data, target_column)
import pandas as pd
from imblearn.over_sampling import SMOTE
# Assuming you have your pandas DataFrame df with features and target variable
# Separate the features (X) and target variable (y) from the DataFrame
X = data.drop('Defaulter', axis=1)
y = data['Defaulter']
# Initialize the SMOTE oversampling algorithm
smote = SMOTE(random_state=42)
# Convert X and y to NumPy arrays
X_array = X.values
y_array = y.values
# Perform oversampling on the data
X_resampled, y_resampled = smote.fit_resample(X_array, y_array)
# Convert the resampled arrays back to a pandas DataFrame
X_resampled_df = pd.DataFrame(X_resampled, columns=X.columns)
y_resampled_df = pd.DataFrame(y_resampled, columns=['target'])
# Print the class distribution before and after oversampling
print("Class distribution before oversampling:")
print(y.value_counts())
print("Class distribution after oversampling:")
print(y_resampled_df['target'].value_counts())
X= X_resampled_df
y= y_resampled_df
Class distribution before oversampling: 0 130254 1 13469 Name: Defaulter, dtype: int64 Class distribution after oversampling: 1 130254 0 130254 Name: target, dtype: int64
from sklearn.model_selection import train_test_split
#The test_size parameter is set to 0.2, indicating that 20% of the data will be allocated to the testing set, while the remaining 80% will be used for training.
#The random_state parameter is set to 42 to ensure reproducibility of the split, meaning that the same random split will be obtained each time the code is executed.
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)
# val_x, test_x, val_y, test_y = train_test_split(test_x, test_xy, test_size=0.5, random_state=42)
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import time
from hyperopt import fmin, tpe, hp, STATUS_OK
from hyperopt.pyll import scope
import pickle
from sklearn.metrics import classification_report, confusion_matrix
# Define search space for hyperparameter tuning of XGBoost model.
search_space = {
'learning_rate': hp.loguniform('learning_rate', -7, 0),
'max_depth': scope.int(hp.uniform('max_depth', 1, 100)),
'min_child_weight': hp.loguniform('min_child_weight', -2, 3),
'subsample': hp.uniform('subsample', 0.5, 1),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'gamma': hp.loguniform('gamma', -10, 10),
'alpha': hp.loguniform('alpha', -10, 10),
'lambda': hp.loguniform('lambda', -10, 10),
'objective': 'binary:logistic',
'eval_metric': 'error',
'seed': 123,
}
train_x=train_x
train_y=train_y
test_x=test_x
test_y=test_y
# Finding the best hyperparameters using Hyperopt's fmin function.
best_params = fmin(
fn=lambda params: train_model_xgboost(params, train_x, train_y, test_x, test_y),
space=search_space,
algo=tpe.suggest,
max_evals=15,
rstate=np.random.default_rng(123)
)
# Print the params
print(best_params)
100%|██████████████████████████████████████████████| 15/15 [00:20<00:00, 1.38s/trial, best loss: 0.015277724463552262]
{'alpha': 10097.8429974375, 'colsample_bytree': 0.9941458252246209, 'gamma': 4995.254471659089, 'lambda': 6.390134508060656e-05, 'learning_rate': 0.09100767117912448, 'max_depth': 29.65059937692005, 'min_child_weight': 0.1765314436877681, 'subsample': 0.6246406667012787}
# Access the best hyperparameters
best_hyperparams = {k: best_params[k] for k in best_params}
# Train the final XGBoost model with the best hyperparameters
final_model = xgb.XGBClassifier(
max_depth=int(best_hyperparams['max_depth']),
learning_rate=best_hyperparams['learning_rate'],
gamma=best_hyperparams['gamma'],
subsample=best_hyperparams['subsample'],
colsample_bytree=best_hyperparams['colsample_bytree'],
random_state=42,
tree_method='hist',enable_categorical= True, # Use GPU for faster training (if available)
)
final_model.fit(train_x, train_y) # Train the final model on the entire dataset
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.9941458252246209, device=None,
early_stopping_rounds=None, enable_categorical=True,
eval_metric=None, feature_types=None, gamma=4995.254471659089,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.09100767117912448,
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=29, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.9941458252246209, device=None,
early_stopping_rounds=None, enable_categorical=True,
eval_metric=None, feature_types=None, gamma=4995.254471659089,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.09100767117912448,
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=29, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...)# Assuming `test_x` contains your test feature data
# Assuming `test_y` contains your test target labels
from sklearn.metrics import classification_report, confusion_matrix
# Make predictions on the test data
y_pred = final_model.predict(test_x)
# Print classification metrics
print("Classification Report:")
print(classification_report(test_y, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(test_y, y_pred))
Classification Report:
precision recall f1-score support
0 0.85 0.86 0.86 25937
1 0.86 0.85 0.86 26165
accuracy 0.86 52102
macro avg 0.86 0.86 0.86 52102
weighted avg 0.86 0.86 0.86 52102
Confusion Matrix:
[[22296 3641]
[ 3901 22264]]
import pickle
filename_xgb = 'Output/xgboost_model.pkl'
# # save the model into the file
pickle.dump(final_model, open(filename_xgb, 'wb'))
# Load the model using the lines below
# filename_xgb = '../Output/xgboost_model.pkl'
# final_model = pickle.load(open(filename_xgb, 'rb'))
# Define your parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}
best_parameters=random_forest_classifier_grid_search(param_grid, train_x, train_y)
best_parameters
Best Parameters:
from sklearn.metrics import classification_report, confusion_matrix
# Access the best hyperparameters
best_hyperparams = {k: best_parameters[k] for k in best_parameters}
# Train the randomforest model with the best hyperparameters
final_model1 = RandomForestClassifier(
max_depth=best_hyperparams['max_depth'],
min_samples_split=best_hyperparams['min_samples_split'],
n_estimators=best_hyperparams['n_estimators'],
# Use GPU for faster training (if available)
)
final_model1.fit(train_x, train_y) # Train the final model on the entire dataset
# Assuming `test_x` contains your test feature data
# Assuming `test_y` contains your test target labels
# Make predictions on the test data
y_pred = final_model1.predict(test_x)
# Print classification metrics
print("Classification Report:")
print(classification_report(test_y, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(test_y, y_pred))
Classification Report:
precision recall f1-score support
0 1.00 0.99 1.00 25937
1 0.99 1.00 1.00 26165
accuracy 1.00 52102
macro avg 1.00 1.00 1.00 52102
weighted avg 1.00 1.00 1.00 52102
Confusion Matrix:
[[25773 164]
[ 0 26165]]
import pickle
filename = 'Output/RandomForest_model.pkl'
# # save the model into the file
pickle.dump(final_model1, open(filename, 'wb'))
# Load the model using the lines below
# filename = '../Output/RandomForest_model.pkl'
# final_model1 = pickle.load(open(filename, 'rb'))